Mortgage Probability of Default & Mortgage Fraud¶

Data Analyst: Frankie Ma

Introduction: Mortgage fraud and default pose significant risks to both lenders and borrowers in the U.S. housing market. With the rise of online services offering fake documentation and fraudulent verification, lenders face increasing difficulty in validating borrower income and intent. These schemes—often masked as novelty tools—can lead to occupancy misrepresentation, inflated credit profiles, and fraudulent loan approvals. As lenders like Fannie Mae and Freddie Mac tighten their risk protocols, the consequences of undetected fraud ripple through financial institutions, investors, and ultimately taxpayers. At the same time, mortgage defaults continue to trigger complex legal processes, including foreclosures, which carry steep financial and reputational costs. In this project, we apply machine learning models to assess default risk and explore how data-driven strategies can help lenders better detect anomalies, protect against fraudulent applications, and reduce financial losses across varying interest rate scenarios.

In [1]:
from IPython import display
display.Image("Mortgage.png")
Out[1]:
No description has been provided for this image

[This picture is generated by AI]

Project Goal: Our project aims to develop predictive models that identify high-risk mortgage loan applications by detecting potential defaults and fraudulent behavior. By leveraging machine learning and profit-based evaluation across different interest rate scenarios, we seek to help lenders make more informed, data-driven decisions that reduce financial exposure and enhance loan portfolio quality.

Table of Contents¶

  • Section 1 Anomaly Detection through Feature Engineering
    • 1.1 Target Encoding
      • 1.1.1 Data Spliting
      • 1.1.2 SWEETVIZ Report
      • 1.1.3 Feature enhancement for the categorical variables
      • 1.1.4 Feature enhancement for the continuous variables
    • 1.2 Imputing Missing Values
  • Section 2 Model Building
    • 2.1 Over-Sampling
      • 2.1.1 H2O Moedling
      • 2.1.2 Top Model Selection
      • 2.1.3 Model Performance
    • 2.2 Under-Sampling
      • 2.2.1 H2O Moedling
      • 2.2.2 Top Model Selection
      • 2.2.3 Model Performance
    • 2.3 Model Comparison
  • Section 3 Profit & Loss Calculations (P&L)
    • 3.1 Interest Rate = 4%
      • 3.1.1 Over-Sampling
      • 3.1.2 Under-Sampling
    • 3.2 Interest Rate = 8%
      • 3.2.1 Over-Sampling
      • 3.2.2 Under-Sampling
    • 3.3 Interest Rate = 12%
      • 3.3.1 Over-Sampling
      • 3.3.2 Under-Sampling
  • Section 4 Summary & Conclusions
    • 4.1 Comparison with Previous H2O Model
    • 4.2 Conclusions
In [2]:
# Before we get started, let's load all the packages we are going to use in this project.
# data
import pandas as pd
import numpy as np

# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
#from wordcloud import WordCloud

from sklearn import datasets
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# styling
%matplotlib inline
sns.set_style('darkgrid')
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'

import os 
#from wordcloud import WordCloud

import warnings
warnings.filterwarnings("ignore")

Section 1 Anomaly Detection through Feature Engineering ¶

In [3]:
df = pd.read_csv('XYZloan_default_llm.csv')
df.head()
Out[3]:
Unnamed: 0.1 Unnamed: 0 AP001 AP002 AP003 AP006 AP007 AP008 CR004 CR009 ... TD005 TD006 TD009 TD010 TD013 TD014 TD022 TD024 loan_default reason
0 4 76031 33 1 3 h5 4 3 4 63100 ... 4 1 4 1 4 1 10.0 0.0 1 I’d really appreciate if we could move faster ...
1 5 23312 34 1 3 h5 5 5 3 53370 ... 3 1 6 2 7 2 15.0 10.0 1 We’re trying to align the closing date with a ...
2 9 66033 36 2 1 ios 2 2 3 5400 ... 4 2 4 2 5 2 25.0 0.0 1 It would really help to close this week so I c...
3 10 41847 28 1 1 ios 5 5 3 2000 ... 4 4 7 4 7 4 25.0 6.0 1 There are some logistics around my move that m...
4 13 28275 35 2 4 h5 3 3 4 27704 ... 4 1 4 1 7 1 25.0 0.0 1 I’d like to close by Friday if possible—the se...

5 rows × 32 columns

In [4]:
columns = ['Unnamed: 0.1', 'Unnamed: 0', 'AP001', 'AP002', 'AP003', 'AP006',
       'AP007', 'AP008', 'CR004', 'CR009', 'CR015', 'CR017', 'CR018', 'CR019',
       'MB005', 'MB007', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD001',
       'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD022', 'TD024',
       'loan_default', 'reason']
data = df[columns]
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0.1  16000 non-null  int64  
 1   Unnamed: 0    16000 non-null  int64  
 2   AP001         16000 non-null  int64  
 3   AP002         16000 non-null  int64  
 4   AP003         16000 non-null  int64  
 5   AP006         16000 non-null  object 
 6   AP007         16000 non-null  int64  
 7   AP008         16000 non-null  int64  
 8   CR004         16000 non-null  int64  
 9   CR009         16000 non-null  int64  
 10  CR015         16000 non-null  int64  
 11  CR017         16000 non-null  int64  
 12  CR018         16000 non-null  int64  
 13  CR019         16000 non-null  int64  
 14  MB005         15441 non-null  float64
 15  MB007         16000 non-null  object 
 16  PA022         15918 non-null  float64
 17  PA023         15918 non-null  float64
 18  PA028         15918 non-null  float64
 19  PA029         15918 non-null  float64
 20  PA031         15918 non-null  float64
 21  TD001         16000 non-null  int64  
 22  TD005         16000 non-null  int64  
 23  TD006         16000 non-null  int64  
 24  TD009         16000 non-null  int64  
 25  TD010         16000 non-null  int64  
 26  TD013         16000 non-null  int64  
 27  TD014         16000 non-null  int64  
 28  TD022         12019 non-null  float64
 29  TD024         15251 non-null  float64
 30  loan_default  16000 non-null  int64  
 31  reason        16000 non-null  object 
dtypes: float64(8), int64(21), object(3)
memory usage: 3.9+ MB

1.1 Target Encoding ¶

In [5]:
# specify categorical & numeric data type
cat_var = ['AP006', 'MB007', 'reason']
num_var = ['Unnamed: 0.1', 'Unnamed: 0', 'AP001', 'AP002', 'AP003', 'AP007', 
           'AP008', 'CR004', 'CR009', 'CR015', 'CR017', 'CR018', 'CR019',
           'MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD001',
           'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD022', 'TD024']
X_vars = cat_var + num_var
target = 'loan_default'
data[target].value_counts()
Out[5]:
loan_default
0    12924
1     3076
Name: count, dtype: int64

1.1.1 Data Spliting ¶

In [6]:
X = data.drop(target, axis = 1)
y = data[target]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
[X_train.shape,X_test.shape, y_train.shape, y_test.shape]
Out[6]:
[(12000, 31), (4000, 31), (12000,), (4000,)]

1.1.2 SWEETVIZ Report ¶

In [7]:
import sweetviz as sv

train = pd.concat([X_train, y_train], axis=1)
report = sv.analyze(train)
report.show_html()
                                             |      | [  0%]   00:00 -> (? left)
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
In [8]:
report = sv.analyze(train, target_feat = target)
report.show_html('repot.html')
report.show_notebook()
                                             |      | [  0%]   00:00 -> (? left)
Report repot.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

From the report, we can tell that there are only a few missing data in our dataset. However, there seems to be a lot of outliers under CR009, PA029, TD001, TD005, TD006, TD009, TD010, and TD014. These columns probably needs some further inspection.

1.1.3 Feature enhancement for the categorical variables ¶

In this section, we are going to create a new set of mean-encoded variables for the categorical variables, then to use the X-train to get the mean first to apply to the test data afterwards.

In [9]:
from category_encoders import target_encoder as te
ec = te.TargetEncoder()

X_train_m = pd.DataFrame()
X_test_m = pd.DataFrame()

for var in cat_var:
    X_train_m[var+'_D'] = ec.fit_transform(X_train[var],y_train)
    X_test_m[var+'_D'] = ec.transform(X_test[var])

X_train_m
Out[9]:
AP006_D MB007_D reason_D
8928 0.188905 0.189318 4.365192e-05
9895 0.188905 0.189318 1.774992e-05
14163 0.220118 0.245247 0.000000e+00
14352 0.180544 0.164782 7.235891e-07
15916 0.180544 0.178959 3.017956e-16
... ... ... ...
13123 0.188905 0.189318 1.318461e-06
3264 0.188905 0.189318 0.000000e+00
9845 0.188905 0.189318 0.000000e+00
10799 0.180544 0.178959 4.345156e-04
2732 0.188905 0.189318 4.984015e-01

12000 rows × 3 columns

1.1.4 Feature enhancement for the continuous variables ¶

In [10]:
# let's inspect one of the data and cut into 5 bins
X_train['MB005_D'] = pd.qcut(X_train['MB005'], 5)
X_train['MB005_D']
Out[10]:
8928     (-0.001, 3.0]
9895        (4.0, 6.0]
14163       (6.0, 9.0]
14352      (9.0, 47.0]
15916       (3.0, 4.0]
             ...      
13123      (9.0, 47.0]
3264        (6.0, 9.0]
9845        (4.0, 6.0]
10799       (6.0, 9.0]
2732        (4.0, 6.0]
Name: MB005_D, Length: 12000, dtype: category
Categories (5, interval[float64, right]): [(-0.001, 3.0] < (3.0, 4.0] < (4.0, 6.0] < (6.0, 9.0] < (9.0, 47.0]]
In [11]:
# according to the report, there are about 52 NA under PA029
# drop all missing values
X_train['PA029_D'] = pd.qcut(X_train['PA029'], 5, duplicates='drop').cat.add_categories('NoData')
X_train['PA029_D'] = X_train['PA029_D'].fillna('NoData').astype(str)
X_train['PA029_D']
Out[11]:
8928     (40.218, 1462.0]
9895     (-99.001, -98.0]
14163    (40.218, 1462.0]
14352    (-99.001, -98.0]
15916    (-99.001, -98.0]
               ...       
13123    (-99.001, -98.0]
3264     (-99.001, -98.0]
9845      (-98.0, 40.218]
10799    (40.218, 1462.0]
2732     (40.218, 1462.0]
Name: PA029_D, Length: 12000, dtype: object
In [12]:
PA029_cutpoints = X_train['PA029'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0])
PA029_cutpoints
Out[12]:
0.0     -99.000000
0.1     -98.000000
0.2     -98.000000
0.3     -98.000000
0.4     -98.000000
0.5     -98.000000
0.6     -98.000000
0.7      12.595556
0.8      40.217948
0.9      82.000000
1.0    1462.000000
Name: PA029, dtype: float64

Looks like our continuous variable, PA029, has not only has several NA values, but also '-99' and '-98' in it.

In [13]:
X_train.loc[X_train['PA029_D']==-99, 'PA029_D'] = -99
X_train.loc[X_train['PA029_D']==-98, 'PA029_D'] = -98
In [14]:
X_train['PA029_D'].value_counts()
Out[14]:
PA029_D
(-99.001, -98.0]    8128
(40.218, 1462.0]    2390
(-98.0, 40.218]     1430
NoData                52
Name: count, dtype: int64
In [15]:
train_bins = pd.qcut(X_train['PA029'], q=5, duplicates='drop', retbins=True)[1]
train_bins
Out[15]:
array([ -99.       ,  -98.       ,   40.2179484, 1462.       ])
In [16]:
# apply same bins to the 
X_test['PA029_D'] = pd.cut(X_test['PA029'], bins=train_bins, include_lowest=True)
X_test['PA029_D'] = pd.cut(X_test['PA029'], bins=train_bins, include_lowest=True).cat.add_categories('NoData')
X_test['PA029_D'] = X_test['PA029_D'].fillna('NoData').astype('str')
X_test.loc[X_test['PA029_D']==-99, 'PA029_D'] = -99
X_test.loc[X_test['PA029_D']==-98, 'PA029_D'] = -98
X_test['PA029_D']
Out[16]:
841      (40.218, 1462.0]
8650     (-99.001, -98.0]
3701      (-98.0, 40.218]
13270    (-99.001, -98.0]
12349    (-99.001, -98.0]
               ...       
190       (-98.0, 40.218]
4816     (-99.001, -98.0]
8671     (-99.001, -98.0]
12539    (-99.001, -98.0]
11619     (-98.0, 40.218]
Name: PA029_D, Length: 4000, dtype: object
In [17]:
X_test['PA029_D'].value_counts()
Out[17]:
PA029_D
(-99.001, -98.0]    2780
(40.218, 1462.0]     776
(-98.0, 40.218]      413
NoData                31
Name: count, dtype: int64
In [18]:
# target encoding
ec = te.TargetEncoder(cols=['PA029_D'])
X_train_e = pd.DataFrame()
X_test_e = pd.DataFrame()
X_train_e['PA029_D'] = ec.fit_transform(X_train['PA029_D'], y_train)['PA029_D']
X_test_e['PA029_D'] = ec.transform(X_test['PA029_D'])['PA029_D']
X_train_e
Out[18]:
PA029_D
8928 0.260669
9895 0.174090
14163 0.260669
14352 0.174090
15916 0.174090
... ...
13123 0.174090
3264 0.174090
9845 0.195804
10799 0.260669
2732 0.260669

12000 rows × 1 columns

In [19]:
X_train_m = pd.concat([X_train_m, X_train_e], axis=1)
X_test_m = pd.concat([X_test_m, X_test_e], axis=1)

1.2 Imputing Missing Values ¶

In [20]:
X_train_numvar = X_train[num_var]
missing_columns = X_train_numvar.columns[X_train_numvar.isnull().sum() > 0]

# Display the columns with missing values
missing_columns
Out[20]:
Index(['MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD022', 'TD024'], dtype='object')
In [21]:
for col in missing_columns:
    mean_value = X_train[col].mean()
    # Impute the missing values in both X_train and X_test
    X_train[col].fillna(mean_value, inplace=True)
    X_test[col].fillna(mean_value, inplace=True)

Section 2 Model Building ¶

The categorical columns are in X_train_dummy while the numerical columns are filled with mean values in X_train[num_var]. To model the data for following analysis, we will combine the X_train_dummy and X_train[num_var] (similar for test data).

In [22]:
# modeling data
X_train_model = pd.concat([X_train_m, X_train[num_var]], axis=1)
X_train_model.shape

X_test_model = pd.concat([X_test_m, X_test[num_var]], axis=1)
[X_train_model.shape, X_test_model.shape]
Out[22]:
[(12000, 32), (4000, 32)]

In our case, there are 12,924 0 and 3,076 1, so 0s are our majority, and 1s are our minority. Here, we are going to test out both sampling techniques, over-sampling and under-sampling in this section.

In [32]:
from sklearn.decomposition import PCA
from collections import Counter
import pylab as pl
from sklearn.preprocessing import StandardScaler

X_train_model = X_train_model.loc[y_train.index]
X_train_model = X_train_model.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

def plot_this(X_rs, y_rs, method):
    X_scaled = StandardScaler().fit_transform(X_rs)  # scale to mean 0, std 1
    p2 = PCA(n_components=2).fit_transform(X_scaled)
    y_rs = y_rs.to_numpy()

    pl.scatter(p2[y_rs == 0, 0], p2[y_rs == 0, 1], c='lightblue', s=10, label='Class 0')
    pl.scatter(p2[y_rs == 1, 0], p2[y_rs == 1, 1], c='orange', s=10, label='Class 1')
    pl.legend()
    pl.title(method)
    pl.axis([-4, 5, -4, 4])
    pl.show()

plot_this(X_train_model,y_train,'Original')
No description has been provided for this image

2.1 Over-Sampling ¶

In [33]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy={1: 3076, 0: 12924})
X_train_o, y_train_o = smote.fit_resample(X_train_model, y_train)
print('SMOTE {}'.format(Counter(y_train_o)))
plot_this(X_train_o, y_train_o,'SMOTE')
SMOTE Counter({0: 12924, 1: 3076})
No description has been provided for this image

2.1.1 H2O Modeling ¶

In [37]:
# H2O initialization
import h2o
from h2o.automl import H2OAutoML
h2o.init()

train_model_o=pd.concat([X_train_o, y_train_o],axis=1)
test_model_o=pd.concat([X_test_model,y_test],axis=1)

# Convert to H2O Frame
train_hex_o = h2o.H2OFrame(train_model_o)
test_hex_o = h2o.H2OFrame(test_model_o)
train_hex_o[target] = train_hex_o[target].asfactor()
test_hex_o[target] = test_hex_o[target].asfactor()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is (4 months and 12 days) old.  There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
H2O_cluster_uptime: 5 days 3 hours 22 mins
H2O_cluster_timezone: America/New_York
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.46.0.7
H2O_cluster_version_age: 4 months and 12 days
H2O_cluster_name: H2O_from_python_mxx_h398lp
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.982 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.4 final
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
████████████████████████████████████████████████████████████████| (done) 100%
In [40]:
# Run AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y=target, training_frame=train_hex_o)

# View leaderboard
lb = aml.leaderboard
print(lb)

# Predict
preds = aml.leader.predict(test_hex_o)
The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
AutoML progress: |
21:44:30.925: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%
model_id                                                   auc      logloss    aucpr    mean_per_class_error         rmse          mse
GBM_grid_1_AutoML_5_20250808_214430_model_1                  1  3.55804e-06        1                       0  2.9717e-05   8.83099e-10
GBM_5_AutoML_5_20250808_214430                               1  2.88658e-18        1                       0  2.5317e-17   6.40949e-34
StackedEnsemble_BestOfFamily_1_AutoML_5_20250808_214430      1  6.38125e-05        1                       0  0.000292585  8.56059e-08
StackedEnsemble_AllModels_1_AutoML_5_20250808_214430         1  4.06801e-05        1                       0  5.16192e-05  2.66455e-09
GBM_3_AutoML_5_20250808_214430                               1  1.45717e-17        1                       0  5.91656e-17  3.50057e-33
GBM_2_AutoML_5_20250808_214430                               1  6.245e-18          1                       0  3.77313e-17  1.42365e-33
GLM_1_AutoML_5_20250808_214430                               1  0.000652354        1                       0  0.00428889   1.83946e-05
GBM_grid_1_AutoML_5_20250808_214430_model_3                  1  2.49221e-13        1                       0  2.10333e-11  4.424e-22
DeepLearning_grid_1_AutoML_5_20250808_214430_model_1         1  0.00192025         1                       0  0.0253579    0.000643021
GBM_4_AutoML_5_20250808_214430                               1  5.76761e-17        1                       0  1.97638e-16  3.90609e-32
[22 rows x 7 columns]

███████████████████████████████████████████████████████| (done) 100%
In [41]:
aml.leaderboard.head(20)
Out[41]:
model_id auc logloss aucpr mean_per_class_error rmse mse
GBM_grid_1_AutoML_5_20250808_214430_model_1 1 3.55804e-061 0 2.9717e-05 8.83099e-10
GBM_5_AutoML_5_20250808_214430 1 2.88658e-181 0 2.5317e-17 6.40949e-34
StackedEnsemble_BestOfFamily_1_AutoML_5_20250808_2144301 6.38125e-051 0 0.0002925858.56059e-08
StackedEnsemble_AllModels_1_AutoML_5_20250808_214430 1 4.06801e-051 0 5.16192e-052.66455e-09
GBM_3_AutoML_5_20250808_214430 1 1.45717e-171 0 5.91656e-173.50057e-33
GBM_2_AutoML_5_20250808_214430 1 6.245e-18 1 0 3.77313e-171.42365e-33
GLM_1_AutoML_5_20250808_214430 1 0.0006523541 0 0.00428889 1.83946e-05
GBM_grid_1_AutoML_5_20250808_214430_model_3 1 2.49221e-131 0 2.10333e-114.424e-22
DeepLearning_grid_1_AutoML_5_20250808_214430_model_1 1 0.00192025 1 0 0.0253579 0.000643021
GBM_4_AutoML_5_20250808_214430 1 5.76761e-171 0 1.97638e-163.90609e-32
DRF_1_AutoML_5_20250808_214430 1 0.0321971 1 0 0.0530536 0.00281468
XRT_1_AutoML_5_20250808_214430 1 0.0292763 1 0 0.0454691 0.00206744
GBM_grid_1_AutoML_5_20250808_214430_model_5 1 0 1 0 0 0
GBM_grid_1_AutoML_5_20250808_214430_model_4 1 3.8138e-14 1 0 3.68861e-131.36058e-25
GBM_1_AutoML_5_20250808_214430 1 1.6237e-18 1 0 1.97825e-173.91349e-34
GBM_grid_1_AutoML_5_20250808_214430_model_2 1 4.16334e-201 0 3.04047e-189.24446e-36
DeepLearning_grid_2_AutoML_5_20250808_214430_model_1 1 0.00203423 1 3.86877e-050.0230191 0.000529877
DeepLearning_grid_3_AutoML_5_20250808_214430_model_2 1 0.0004167291 0.0001625490.00975301 9.51213e-05
DeepLearning_1_AutoML_5_20250808_214430 0.9999990.00109805 0.999997 0.0002786120.0155298 0.000241175
DeepLearning_grid_1_AutoML_5_20250808_214430_model_2 0.9999980.00158524 0.999993 0.0006423970.0222053 0.000493076
[20 rows x 7 columns]

2.1.2 Top Model Selection ¶

The GBM_grid_1_AutoML_5_20250808_214430_model_1 is the best choice because it delivers perfect classification metrics on the SMOTE-balanced dataset, achieving an AUC of 1.0, zero mean per class error, and extremely low logloss and RMSE. Unlike the top two stacked ensemble models, it meets the requirement of using an individual model while still maintaining exceptional predictive performance.

In [42]:
model_id = 'GBM_grid_1_AutoML_5_20250808_214430_model_1'
best_model_o = h2o.get_model(model_id)
best_model_o.summary()
Out[42]:
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
454.0 454.0 3247052.0 16.0 16.0 15.334802 1.0 744.0 563.20264
In [43]:
# Predict on test data
preds = aml.leader.predict(test_hex_o)
# Get performance metrics on test set
perf = aml.leader.model_performance(test_hex_o)
# Print common regression metrics
print("RMSE:", perf.rmse())
#print("MAE :", perf.mae())
print("R2  :", perf.r2())
print("MSE :", perf.mse())
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
RMSE: 0.04743416500955298
R2  : 0.9851698300946758
MSE : 0.0022500000101535
In [44]:
# Get predictions (returns an H2OFrame with class and probs)
preds = best_model_o.predict(test_hex_o).as_data_frame()
# View the frame
preds.head()
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
/Users/mxx/miniconda3/lib/python3.11/site-packages/h2o/frame.py:1983: H2ODependencyWarning:

Converting H2O frame to pandas dataframe using single-thread.  For faster conversion using multi-thread, install polars and pyarrow and use it as pandas_df = h2o_df.as_data_frame(use_multi_thread=True)


Out[44]:
predict p0 p1
0 0 0.000002 9.999984e-01
1 0 1.000000 2.767149e-09
2 0 1.000000 2.456172e-11
3 0 1.000000 3.648818e-11
4 0 1.000000 1.405225e-09
In [48]:
best_model_o_probs = preds['p1']
best_model_o_pred = preds['predict']

2.1.3 Model Performance ¶

In [60]:
from sklearn.metrics import (roc_curve, roc_auc_score, accuracy_score, confusion_matrix, auc, 
                            average_precision_score, precision_recall_curve, f1_score)
from IPython.display import display

# metrics table
def make_core_metrics_tables(y_true, y_pred, y_prob):
    tblA = pd.DataFrame({
        "metric": ["Accuracy", "ROC AUC", "Average Precision (AUPRC)", "F1 score"],
        "value": [
            accuracy_score(y_true, y_pred),
            roc_auc_score(y_true, y_prob),
            average_precision_score(y_true, y_prob),
            f1_score(y_true, y_pred),
        ],
    })
    cm_tbl = pd.DataFrame(
        confusion_matrix(y_true, y_pred, labels=[0,1]),
        index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"]
    )
    return tblA, cm_tbl

# ROC-AUC
def plot_roc(y_true,y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc_value = roc_auc_score(y_true,y_prob)
    lw=2
    plt.figure(figsize=(6,4))
    plt.plot(fpr,tpr, color='darkorange',lw=lw,label='ROC curve (area = %0.2f)' %roc_auc_value)
    plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
    plt.xlim([0,1])
    plt.ylim([0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curve')
    plt.legend(loc='lower right')
    plt.show()

# precision-recall
def ROC_PR(y_actual, y_pred):
    # ROC
    fpr = list()
    tpr = list()
    roc_auc = list()
    fpr,tpr,_ = roc_curve(y_actual,y_pred)
    roc_auc = auc(fpr,tpr)
    
    # Precision-Recall
    average_precision = average_precision_score(y_actual,y_pred)

    print('')
    print('   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
    print('')
    print('	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
    print('')
    print('   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
    print('')
    
    # plotting
    plt.figure(figsize=(10,4))

    # ROC
    plt.subplot(1,2,1)
    plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
    plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic: AUC={0:0.3f}'.format(roc_auc))
    plt.legend(loc='lower right')

    # Precision-Recall
    plt.subplot(1,2,2)
    precision,recall,_ = precision_recall_curve(y_actual,y_pred)
    plt.step(recall,precision,color='b',alpha=0.2,where='post')
    plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0,1.05])
    plt.xlim([0.0,1.0])
    plt.title('Precision-Recall curve: PR={0:0.3f}'.format(average_precision))
    plt.show()

# gains table
def gains_table_profit(Y_test,y_pred):
    df_prep = pd.DataFrame(columns = ['actual','pred'])
    df_prep['actual'] = Y_test
    df_prep['pred'] =y_pred
    df_prep = df_prep.sort_values(by='pred',ascending=False)
    df_prep['row_id'] = range(0,0+len(df_prep))
    df_prep.head()

    df_prep['decile'] = (df_prep['row_id'] / (len(df_prep)/10)).astype(int)
    df_prep.loc[df_prep['decile'] == 10] =9
    df_prep['decile'].value_counts()

    # Create gains table
    gains = df_prep.groupby('decile')['actual'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual']  ) * 100 
    gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
    gains['revenue'] = approved_load * irr *  gains['non_actual']
    gains['cost'] = approved_load *  gains['actual']
    gains['profit'] = gains['revenue'] - gains['cost']
    return(gains)
# Assume the following values
approved_load = 10000
irr = 0.10 # annual interest rate
In [54]:
tblo, cm_tbl_o = make_core_metrics_tables(y_test, best_model_o_pred, best_model_o_probs)
print("Table A — Core sklearn metrics (Over-Sampling)"); display(tblo)
print("\nConfusion Matrix (Over-Sampling)"); display(cm_tbl_o)
Table A — Core sklearn metrics (Over-Sampling)
metric value
0 Accuracy 0.813500
1 ROC AUC 0.994974
2 Average Precision (AUPRC) 0.993234
3 F1 score 0.000000
Confusion Matrix (Over-Sampling)
Pred 0 Pred 1
Actual 0 3254 0
Actual 1 746 0

Although the model shows a very high ROC AUC (0.995) and average precision (0.993), its F1 score is 0.0, and the confusion matrix reveals it predicts all cases as class 0, missing all positive cases (class 1). This indicates severe class imbalance handling failure despite oversampling, resulting in poor recall for the minority class.

In [55]:
plot_roc(y_test, best_model_o_probs)
No description has been provided for this image
In [58]:
ROC_PR(y_test, best_model_o_probs)
   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

No description has been provided for this image

The ROC curve shows an AUC of 0.995, indicating near-perfect discrimination between positive and negative classes with almost no trade-off between sensitivity and specificity. The Precision-Recall curve, with an area of 0.993, demonstrates that the model maintains extremely high precision across all recall levels, meaning it can correctly identify the vast majority of positive cases while keeping false positives minimal.

In [61]:
# gains table
best_gains_o = gains_table_profit(y_test,best_model_o_probs)
best_gains_o
Out[61]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 84000.0 3160000 -3076000.0
1 400 306 94 800 622 178 0.83 0.05 149.2 4.17 78.0 77.75 94000.0 3060000 -2966000.0
2 400 124 276 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 276000.0 1240000 -964000.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 400000.0 0 400000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 400000.0 0 400000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 400000.0 0 400000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 400000.0 0 400000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 400000.0 0 400000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 400000.0 0 400000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 400000.0 0 400000.0
In [62]:
# K-S Statistic
best_gains_o[['percent_cum_actual','percent_cum_non_actual']].plot.line()
plt.show()
No description has been provided for this image

This gains chart shows that the model captures 100% of the actual positives (blue line) by the 3rd decile, indicating very strong ranking performance. In contrast, the cumulative percentage of non-actuals (orange line) grows much more gradually, meaning the model effectively prioritizes true positives early while minimizing false positives in the top-ranked deciles.

In [63]:
# cumulative lift
best_gains_o['lift'].plot.line()
plt.show()
No description has been provided for this image

This lift chart shows that the model achieves a lift of over 4 in the first decile, meaning it identifies more than four times the positive cases compared to random selection. Lift decreases steadily across subsequent deciles, indicating that the model ranks the most relevant cases early, but predictive power diminishes as it moves toward lower-ranked segments.

2.2 Under-Sampling ¶

In [35]:
from imblearn.under_sampling import NearMiss

NearMiss = NearMiss()
X_train_u, y_train_u = NearMiss.fit_resample(X_train_model, y_train)
print('NearMiss{}'.format(Counter(y_train_u)))
plot_this(X_train_u, y_train_u,'NearMiss')
NearMissCounter({0: 2330, 1: 2330})
No description has been provided for this image

2.2.1 H2O Modeling ¶

In [65]:
# H2O initialization
h2o.init()

train_model_u=pd.concat([X_train_u, y_train_u],axis=1)
test_model_u=pd.concat([X_test_model,y_test],axis=1)

# Convert to H2O Frame
train_hex_u = h2o.H2OFrame(train_model_u)
test_hex_u = h2o.H2OFrame(test_model_u)
train_hex_u[target] = train_hex_u[target].asfactor()
test_hex_u[target] = test_hex_u[target].asfactor()

# Run AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y=target, training_frame=train_hex_u)

# View leaderboard
lb = aml.leaderboard
print(lb)

# Predict
preds = aml.leader.predict(test_hex_u)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is (4 months and 12 days) old.  There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
H2O_cluster_uptime: 5 days 4 hours 28 mins
H2O_cluster_timezone: America/New_York
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.46.0.7
H2O_cluster_version_age: 4 months and 12 days
H2O_cluster_name: H2O_from_python_mxx_h398lp
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.615 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.4 final
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
22:38:13.62: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%
model_id                                                auc      logloss    aucpr    mean_per_class_error         rmse          mse
GBM_3_AutoML_7_20250808_223813                            1  2.60402e-17        1                       0  8.87076e-17  7.86904e-33
DRF_1_AutoML_7_20250808_223813                            1  0.0595403          1                       0  0.0777526    0.00604547
GBM_grid_1_AutoML_7_20250808_223813_model_4               1  1.5641e-14         1                       0  4.40705e-13  1.94221e-25
GBM_1_AutoML_7_20250808_223813                            1  1.97744e-18        1                       0  2.23588e-17  4.99915e-34
GBM_grid_1_AutoML_7_20250808_223813_model_5               1  0                  1                       0  0            0
GLM_1_AutoML_7_20250808_223813                            1  0.00099787         1                       0  0.00198565   3.94281e-06
GBM_grid_1_AutoML_7_20250808_223813_model_1               1  5.92014e-06        1                       0  0.000127262  1.61957e-08
GBM_grid_1_AutoML_7_20250808_223813_model_3               1  2.38148e-09        1                       0  8.84254e-08  7.81906e-15
DeepLearning_grid_3_AutoML_7_20250808_223813_model_2      1  0.066989           1                       0  0.130773     0.0171016
GBM_5_AutoML_7_20250808_223813                            1  7.7668e-18         1                       0  4.22854e-17  1.78806e-33
[22 rows x 7 columns]

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
In [66]:
aml.leaderboard.head(20)
Out[66]:
model_id auc logloss aucpr mean_per_class_error rmse mse
GBM_3_AutoML_7_20250808_223813 1 2.60402e-171 0 8.87076e-177.86904e-33
DRF_1_AutoML_7_20250808_223813 1 0.0595403 1 0 0.0777526 0.00604547
GBM_grid_1_AutoML_7_20250808_223813_model_4 1 1.5641e-14 1 0 4.40705e-131.94221e-25
GBM_1_AutoML_7_20250808_223813 1 1.97744e-181 0 2.23588e-174.99915e-34
GBM_grid_1_AutoML_7_20250808_223813_model_5 1 0 1 0 0 0
GLM_1_AutoML_7_20250808_223813 1 0.00099787 1 0 0.00198565 3.94281e-06
GBM_grid_1_AutoML_7_20250808_223813_model_1 1 5.92014e-061 0 0.0001272621.61957e-08
GBM_grid_1_AutoML_7_20250808_223813_model_3 1 2.38148e-091 0 8.84254e-087.81906e-15
DeepLearning_grid_3_AutoML_7_20250808_223813_model_2 1 0.066989 1 0 0.130773 0.0171016
GBM_5_AutoML_7_20250808_223813 1 7.7668e-18 1 0 4.22854e-171.78806e-33
GBM_4_AutoML_7_20250808_223813 1 1.19837e-161 0 4.31596e-161.86275e-31
GBM_2_AutoML_7_20250808_223813 1 1.25079e-171 0 5.69227e-173.24019e-33
GBM_grid_1_AutoML_7_20250808_223813_model_2 1 4.76491e-201 0 3.25272e-181.05802e-35
StackedEnsemble_AllModels_1_AutoML_7_20250808_223813 1 6.54544e-051 0 6.55141e-054.29209e-09
StackedEnsemble_BestOfFamily_1_AutoML_7_20250808_2238131 8.78055e-051 0 0.0001014341.02888e-08
XRT_1_AutoML_7_20250808_223813 1 0.0574092 1 0 0.0716859 0.00513887
DeepLearning_grid_2_AutoML_7_20250808_223813_model_2 0.9999990.0469518 0.999999 0.0002145920.114544 0.0131202
DeepLearning_1_AutoML_7_20250808_223813 0.9999980.0019878 0.999998 0.0006437770.0217402 0.000472635
DeepLearning_grid_1_AutoML_7_20250808_223813_model_1 0.9999910.00683069 0.999991 0.0006437770.0373196 0.00139275
DeepLearning_grid_3_AutoML_7_20250808_223813_model_1 0.9999870.0114072 0.999987 0.00128755 0.0559844 0.00313425
[20 rows x 7 columns]

2.2.2 Top Model Selection ¶

The best choice here is GBM_grid_1_AutoML_7_20250808_223813_model_5, because it achieves perfect scores across key metrics — AUC, AUCPR, and mean per class error — while also having the lowest possible logloss, RMSE, and MSE (all at zero). This indicates that the model fits the training data extremely well, though its perfect performance suggests we should double-check for overfitting before deployment.

In [67]:
model_id = 'GBM_grid_1_AutoML_7_20250808_223813_model_5'
best_model_u = h2o.get_model(model_id)
best_model_u.summary()
Out[67]:
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
395.0 395.0 33439.0 1.0 1.0 1.0 2.0 2.0 2.0
In [68]:
# Predict on test data
preds = aml.leader.predict(test_hex_u)
# Get performance metrics on test set
perf = aml.leader.model_performance(test_hex_u)
# Print common regression metrics
print("RMSE:", perf.rmse())
#print("MAE :", perf.mae())
print("R2  :", perf.r2())
print("MSE :", perf.mse())
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
RMSE: 0.03872983346207417
R2  : 0.990113220107733
MSE : 0.0015
In [71]:
# Get predictions (returns an H2OFrame with class and probs)
preds = best_model_u.predict(test_hex_u).as_data_frame()
# View the frame
preds.head()
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
/Users/mxx/miniconda3/lib/python3.11/site-packages/h2o/frame.py:1983: H2ODependencyWarning:

Converting H2O frame to pandas dataframe using single-thread.  For faster conversion using multi-thread, install polars and pyarrow and use it as pandas_df = h2o_df.as_data_frame(use_multi_thread=True)


Out[71]:
predict p0 p1
0 1 0 1.000000e+00
1 0 1 3.379019e-18
2 0 1 3.379019e-18
3 0 1 3.379019e-18
4 0 1 3.379019e-18
In [72]:
best_model_u_probs = preds['p1']
best_model_u_pred = preds['predict']

2.2.3 Model Performance ¶

In [73]:
tblu, cm_tbl_u = make_core_metrics_tables(y_test, best_model_u_pred, best_model_u_probs)
print("Table B — Core sklearn metrics (Under-Sampling)"); display(tblu)
print("\nConfusion Matrix (Under-Sampling)"); display(cm_tbl_u)
Table B — Core sklearn metrics (Under-Sampling)
metric value
0 Accuracy 0.998500
1 ROC AUC 0.999078
2 Average Precision (AUPRC) 0.992021
3 F1 score 0.995995
Confusion Matrix (Under-Sampling)
Pred 0 Pred 1
Actual 0 3248 6
Actual 1 0 746

The under-sampling model delivers exceptional performance, achieving 99.85% accuracy, a ROC AUC of 0.9991, and a high F1 score of 0.9960. The confusion matrix shows it correctly classified all positive cases and misclassified only 6 out of 3,254 negative cases, indicating near-perfect discrimination between classes.

In [74]:
plot_roc(y_test, best_model_o_probs)
No description has been provided for this image
In [80]:
ROC_PR(y_test, best_model_u_probs)
   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

No description has been provided for this image

These curves show that the model delivers almost perfect classification performance. The ROC curve has an AUC of 0.999, indicating near-ideal separation between positive and negative classes, with a true positive rate close to 1 across nearly all false positive rates. The Precision-Recall curve also demonstrates exceptional results, with precision and recall both near 1, confirming that the model maintains high accuracy in identifying positives even in imbalanced data scenarios.

In [77]:
# gains table
best_gains_u = gains_table_profit(y_test,best_model_u_probs)
best_gains_u
Out[77]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 84000.0 3160000 -3076000.0
1 400 304 96 800 620 180 0.83 0.06 149.2 4.16 77.0 77.50 96000.0 3040000 -2944000.0
2 400 126 274 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 274000.0 1260000 -986000.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 400000.0 0 400000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 400000.0 0 400000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 400000.0 0 400000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 400000.0 0 400000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 400000.0 0 400000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 400000.0 0 400000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 400000.0 0 400000.0
In [78]:
# K-S Statistic
best_gains_u[['percent_cum_actual','percent_cum_non_actual']].plot.line()
plt.show()
No description has been provided for this image

This cumulative gains chart indicates that the model effectively ranks the actual positives at the top deciles. By the 3rd decile, 100% of actual positives have been captured, while the percentage of non-actuals (false positives) remains relatively low until later deciles. The steep rise of the blue line compared to the orange line demonstrates strong discriminative ability, with the model successfully prioritizing true positives ahead of non-positives.

In [79]:
# cumulative lift
best_gains_u['lift'].plot.line()
plt.show()
No description has been provided for this image

This lift chart shows that the model achieves its highest lift of over 4 in the first decile, meaning it identifies more than four times as many positives compared to random targeting. The lift gradually decreases across subsequent deciles, approaching 1 in the last decile, which indicates no improvement over random selection. The steep decline suggests that the model is highly effective at ranking positives at the top but offers diminishing returns as you move to lower-ranked segments.

2.3 Model Comparison ¶

The under-sampling approach clearly outperforms over-sampling in this case. Under-sampling achieved near-perfect accuracy (99.85%) and an F1 score of 0.996, with very few false positives and zero false negatives, indicating excellent balance between precision and recall. In contrast, over-sampling produced a much lower accuracy (81.35%) and an F1 score of 0, failing to correctly classify any positive cases despite a high ROC AUC, suggesting severe class prediction bias.

Section 3 Profit & Loss Calculations (P&L) ¶

Assumptions

  • Each approved loan has a principal value of $10,000.
  • For non-defaulted loans, the bank earns interest revenue based on a given annual interest rate (e.g., 4%, 8%, 12%).
  • If a loan defaults, the bank incurs a total loss of $10,000 (i.e., the full loan amount is unrecovered).

Profit Calculation Framework

  • If a loan is approved and does not default (loan_default = 0): The bank earns interest revenue (e.g., $400, $800, or $1,200 depending on the interest rate).
  • If a loan is approved and defaults (loan_default = 1): The bank loses the entire $10,000 principal.
  • We will compute:
  • Total Revenue: Sum of interest earned from loans that did not default.
  • Total Loss: Sum of principal lost due to defaults.
  • Net Profit: Total Revenue - Total Loss.

What We Can Do

  • Compare the profit and loss (P&L) outcomes from both the Decision Tree and Logistic Regression models based on their predictions.
  • Perform a sensitivity analysis by calculating the P&L for different interest rates (4%, 8%, 12%).
  • Use these results to evaluate which model offers better financial outcomes, and assess how changing interest rates influence model performance and profitability.

3.1 Interest Rate = 4% ¶

3.1.1 Over-Sampling ¶

In [81]:
irr = 0.04 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 1120000.0
Out[81]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 33600.0 3160000 -3126400.0
1 400 306 94 800 622 178 0.83 0.05 149.2 4.17 78.0 77.75 37600.0 3060000 -3022400.0
2 400 124 276 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 110400.0 1240000 -1129600.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 160000.0 0 160000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 160000.0 0 160000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 160000.0 0 160000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 160000.0 0 160000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 160000.0 0 160000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 160000.0 0 160000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 160000.0 0 160000.0

3.1.2 Under-Sampling ¶

In [82]:
irr = 0.04 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 1120000.0
Out[82]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 33600.0 3160000 -3126400.0
1 400 304 96 800 620 180 0.83 0.06 149.2 4.16 77.0 77.50 38400.0 3040000 -3001600.0
2 400 126 274 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 109600.0 1260000 -1150400.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 160000.0 0 160000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 160000.0 0 160000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 160000.0 0 160000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 160000.0 0 160000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 160000.0 0 160000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 160000.0 0 160000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 160000.0 0 160000.0

Key observations:

  • Both over-sampling and under-sampling approaches achieve the same total positive profit of 1,120,000 at a 4% interest rate, but the distribution across deciles is different.
  • In the over-sampling case, positive profits start appearing at decile 3, with earlier deciles (0–2) generating significant losses due to high costs from false positives.
  • In the under-sampling case, the positive profit also starts at decile 3, but the loss magnitude in earlier deciles is slightly lower in decile 2 and marginally higher in decile 1 compared to over-sampling.
  • The lift and K-S patterns are similar, but small variations could be important for threshold optimization.

3.2 Interest Rate = 8% ¶

3.2.1 Over-Sampling ¶

In [83]:
irr = 0.08 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 2240000.0
Out[83]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 67200.0 3160000 -3092800.0
1 400 306 94 800 622 178 0.83 0.05 149.2 4.17 78.0 77.75 75200.0 3060000 -2984800.0
2 400 124 276 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 220800.0 1240000 -1019200.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 320000.0 0 320000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 320000.0 0 320000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 320000.0 0 320000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 320000.0 0 320000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 320000.0 0 320000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 320000.0 0 320000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 320000.0 0 320000.0

3.2.2 Under-Sampling ¶

In [84]:
irr = 0.08 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 2240000.0
Out[84]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 67200.0 3160000 -3092800.0
1 400 304 96 800 620 180 0.83 0.06 149.2 4.16 77.0 77.50 76800.0 3040000 -2963200.0
2 400 126 274 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 219200.0 1260000 -1040800.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 320000.0 0 320000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 320000.0 0 320000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 320000.0 0 320000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 320000.0 0 320000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 320000.0 0 320000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 320000.0 0 320000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 320000.0 0 320000.0

Key observations:

  • Both over-sampling and under-sampling approaches achieve the same total positive profit of 2,240,000 at an 8% interest rate, with very similar profit distribution across deciles.
  • In both cases, positive profits start appearing at decile 3, with deciles 0–2 producing substantial losses due to high costs from false positives.
  • Loss magnitudes in the early deciles are nearly identical between the two methods, showing only minimal differences in revenue and cost patterns.
  • Lift and K-S statistics follow almost identical trends for both methods, indicating comparable model ranking performance across deciles.
  • Given the similar patterns, the choice between over-sampling and under-sampling at 8% IRR is unlikely to materially impact profit outcomes.

3.3 Interest Rate = 12% ¶

3.3.1 Over-Sampling ¶

In [85]:
irr = 0.12 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 3360000.0
Out[85]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 100800.0 3160000 -3059200.0
1 400 306 94 800 622 178 0.83 0.05 149.2 4.17 78.0 77.75 112800.0 3060000 -2947200.0
2 400 124 276 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 331200.0 1240000 -908800.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 480000.0 0 480000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 480000.0 0 480000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 480000.0 0 480000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 480000.0 0 480000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 480000.0 0 480000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 480000.0 0 480000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 480000.0 0 480000.0

3.3.2 Under-Sampling ¶

In [86]:
irr = 0.12 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 3360000.0
Out[86]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain revenue cost profit
decile
0 400 316 84 400 316 84 0.42 0.03 74.6 4.24 39.0 79.00 100800.0 3160000 -3059200.0
1 400 304 96 800 620 180 0.83 0.06 149.2 4.16 77.0 77.50 115200.0 3040000 -2924800.0
2 400 126 274 1200 746 454 1.00 0.14 223.8 3.33 86.0 62.17 328800.0 1260000 -931200.0
3 400 0 400 1600 746 854 1.00 0.26 298.4 2.50 74.0 46.62 480000.0 0 480000.0
4 400 0 400 2000 746 1254 1.00 0.39 373.0 2.00 61.0 37.30 480000.0 0 480000.0
5 400 0 400 2400 746 1654 1.00 0.51 447.6 1.67 49.0 31.08 480000.0 0 480000.0
6 400 0 400 2800 746 2054 1.00 0.63 522.2 1.43 37.0 26.64 480000.0 0 480000.0
7 400 0 400 3200 746 2454 1.00 0.75 596.8 1.25 25.0 23.31 480000.0 0 480000.0
8 400 0 400 3600 746 2854 1.00 0.88 671.4 1.11 12.0 20.72 480000.0 0 480000.0
9 400 0 400 4000 746 3254 1.00 1.00 746.0 1.00 0.0 18.65 480000.0 0 480000.0

Key observations:

  • Both over-sampling and under-sampling approaches achieve the same total positive profit of 3,360,000 at a 12% interest rate, with identical decile-level profit distribution once profits turn positive.
  • In both cases, positive profits begin at decile 3, while deciles 0–2 incur substantial losses caused by high costs from false positives.
  • The magnitude of early-decile losses is nearly identical for both methods, with only slight differences in revenue amounts (e.g., decile 2 revenue is marginally higher in under-sampling).
  • Lift and K-S statistics follow almost identical trajectories between the two methods, reflecting similar model ranking and discriminatory power.
  • Given the nearly identical profit and performance patterns, the choice between over-sampling and under-sampling at 12% IRR would likely have negligible impact on final profit outcomes.

Section 4 Summary & Conclusions ¶

4.1 Comparison with Previous H2O Model ¶

Let's compare the results we had today and compare it with the results we had last week.

In [98]:
def plot_comparison(irr, over_profit, under_profit, prev_profit):
    df_over = pd.DataFrame({'decile': list(range(10)), 'profit': over_profit})
    df_under = pd.DataFrame({'decile': list(range(10)), 'profit': under_profit})
    df_prev = pd.DataFrame({'decile': list(range(10)), 'profit': prev_profit})

    df_over["cum_profit"] = df_over["profit"].cumsum()
    df_under["cum_profit"] = df_under["profit"].cumsum()
    df_prev["cum_profit"] = df_prev["profit"].cumsum()

    fig, axs = plt.subplots(1, 2, figsize=(14, 6))
    fig.suptitle(f"Profit Comparison by Decile — IRR = {irr*100:.0f}%", fontsize=16, fontweight='bold')

    # Profit by Decile
    axs[0].plot(df_over["decile"], df_over["profit"], marker='o', label='Over-Sampling', color='blue')
    axs[0].plot(df_under["decile"], df_under["profit"], marker='s', label='Under-Sampling', color='orange')
    axs[0].plot(df_prev["decile"], df_prev["profit"], marker='^', label='Previous H2O Model', color='green')
    axs[0].axhline(0, color='black', linewidth=1, linestyle='--')
    axs[0].set_title("Profit by Decile")
    axs[0].set_xlabel("Decile")
    axs[0].set_ylabel("Profit")
    axs[0].legend()
    axs[0].grid(True)

    # Cumulative Profit
    axs[1].plot(df_over["decile"], df_over["cum_profit"], marker='o', label='Over-Sampling', color='blue')
    axs[1].plot(df_under["decile"], df_under["cum_profit"], marker='s', label='Under-Sampling', color='orange')
    axs[1].plot(df_prev["decile"], df_prev["cum_profit"], marker='^', label='Previous H2O Model', color='green')
    axs[1].axhline(0, color='black', linewidth=1, linestyle='--')
    axs[1].set_title("Cumulative Profit by Decile")
    axs[1].set_xlabel("Decile")
    axs[1].set_ylabel("Cumulative Profit")
    axs[1].legend()
    axs[1].grid(True)

    plt.tight_layout()
    plt.show()


# ===== Example Call for IRR = 4% =====
plot_comparison(
    irr=0.04,
    over_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
    under_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
    prev_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000]
)

# ===== Example Call for IRR = 8% =====
plot_comparison(
    irr=0.08,
    over_profit=[-3092800, -3022400, -1129600, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
    under_profit=[-3092800, -3022400, -1129600, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
    prev_profit=[-3103600, -2941600, -1051600, 320000, 320000, 320000, 320000, 320000, 320000, 320000]
)

# ===== Example Call for IRR = 12% =====
plot_comparison(
    irr=0.12,
    over_profit=[-3059200, -2947200, -908800, 480000, 480000, 480000, 480000, 480000, 480000, 480000],
    under_profit=[-3059200, -2924800, -931200, 480000, 480000, 480000, 480000, 480000, 480000, 480000],
    prev_profit=[-3070400, -2902400, -942400, 480000, 480000, 480000, 480000, 480000, 480000, 480000]
)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
•	Overall trend: Across all three interest rates, the Previous H2O model consistently yields higher cumulative profit starting from decile 3 onward, while Over-Sampling and Under-Sampling produce nearly identical curves.
•	IRR = 4%: All three methods incur heavy losses in the first three deciles, but the H2O model reaches break-even sooner and ends with the highest cumulative profit. Over- and Under-Sampling track almost exactly, indicating no meaningful difference between them.
•	IRR = 8%: The profit gap between the H2O model and sampling-based methods widens, especially in later deciles, as the H2O model’s cumulative profit grows faster. Over- and Under-Sampling remain very close, with a slight edge for Over-Sampling in early deciles.
•	IRR = 12%: The H2O model’s advantage is most pronounced here, achieving the steepest cumulative profit growth after decile 2. Over- and Under-Sampling still overlap heavily, suggesting model choice (H2O vs. sampling) matters more than sampling strategy.

4.2 Conclusions ¶

To evaluate profit performance for predicting mortgage default or fraud, we compared three approaches: Over-Sampling, Under-Sampling, and the previous best-performing H2O AutoML model. Each method was assessed using gains table analysis, lift and K-S statistics, and profit simulations under three interest rate scenarios (4%, 8%, and 12%).

The Previous H2O AutoML model consistently outperformed both Over-Sampling and Under-Sampling across all interest rates. At each IRR, it delivered higher cumulative profits starting from decile 3, maintained strong upward profit trajectories, and avoided the plateau effect seen in the sampling-based methods. Its cumulative profit gains were especially pronounced at higher interest rates, reaching the steepest growth curve at 12% IRR.

Over-Sampling and Under-Sampling produced nearly identical performance across all metrics and interest rates. Both incurred significant early losses in the first three deciles due to high costs from false positives, recovered at decile 3, and then plateaued with stable but lower profits than the H2O model. The differences between the two sampling methods were negligible, with only marginal variations in early-decile profits.

Profit comparison curves showed that while all methods improve as IRR increases, the gap between the H2O model and the sampling-based approaches widens significantly at higher rates, indicating that the model’s superior predictive power translates into greater business value when lending conditions are more favorable.

This analysis demonstrates the importance of evaluating models not only with classification metrics but also through business-driven performance simulations, as technical accuracy does not always guarantee optimal profit distribution across risk segments.

Final Thoughts: Although Over-Sampling and Under-Sampling deliver stable and nearly identical results, the Previous H2O AutoML model is clearly the most profitable and operationally effective choice. Its ability to generate higher profits across all IRRs, coupled with consistent performance gains in higher deciles, makes it the most practical and impactful model for deployment in a mortgage fraud or default detection setting. For future work, combining the H2O model with targeted cost-sensitive adjustments in early deciles could further reduce initial losses and maximize portfolio profitability.